Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
url = "/content/drive/MyDrive/GL/Unsupervised Learning/Project - Trade and Ahead/stock_data.csv"
# Import data
data = pd.read_csv(url)
The initial steps to get an overview of any dataset is to:
data.shape
(340, 15)
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
Observations:
# dropping the ticker symbol column, as it does not provide any information
df.drop("Ticker Symbol", axis=1, inplace=True)
# convert all columns with dtype object into category
for col in df.columns[df.dtypes=='object']:
df[col] = df[col].astype('category')
# verifying results
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Security 340 non-null category 1 GICS Sector 340 non-null category 2 GICS Sub Industry 340 non-null category 3 Current Price 340 non-null float64 4 Price Change 340 non-null float64 5 Volatility 340 non-null float64 6 ROE 340 non-null int64 7 Cash Ratio 340 non-null int64 8 Net Cash Flow 340 non-null int64 9 Net Income 340 non-null int64 10 Earnings Per Share 340 non-null float64 11 Estimated Shares Outstanding 340 non-null float64 12 P/E Ratio 340 non-null float64 13 P/B Ratio 340 non-null float64 dtypes: category(3), float64(7), int64(4) memory usage: 46.7 KB
# summary of categorical columns
df.describe(include='category').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Security | 340 | 340 | 3M Company | 1 |
| GICS Sector | 340 | 11 | Industrials | 53 |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 |
Observartions:
# summary of numerical columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | 8.086234e+01 | 9.805509e+01 | 4.500000e+00 | 3.855500e+01 | 5.970500e+01 | 9.288000e+01 | 1.274950e+03 |
| Price Change | 340.0 | 4.078194e+00 | 1.200634e+01 | -4.712969e+01 | -9.394838e-01 | 4.819505e+00 | 1.069549e+01 | 5.505168e+01 |
| Volatility | 340.0 | 1.525976e+00 | 5.917984e-01 | 7.331632e-01 | 1.134878e+00 | 1.385593e+00 | 1.695549e+00 | 4.580042e+00 |
| ROE | 340.0 | 3.959706e+01 | 9.654754e+01 | 1.000000e+00 | 9.750000e+00 | 1.500000e+01 | 2.700000e+01 | 9.170000e+02 |
| Cash Ratio | 340.0 | 7.002353e+01 | 9.042133e+01 | 0.000000e+00 | 1.800000e+01 | 4.700000e+01 | 9.900000e+01 | 9.580000e+02 |
| Net Cash Flow | 340.0 | 5.553762e+07 | 1.946365e+09 | -1.120800e+10 | -1.939065e+08 | 2.098000e+06 | 1.698108e+08 | 2.076400e+10 |
| Net Income | 340.0 | 1.494385e+09 | 3.940150e+09 | -2.352800e+10 | 3.523012e+08 | 7.073360e+08 | 1.899000e+09 | 2.444200e+10 |
| Earnings Per Share | 340.0 | 2.776662e+00 | 6.587779e+00 | -6.120000e+01 | 1.557500e+00 | 2.895000e+00 | 4.620000e+00 | 5.009000e+01 |
| Estimated Shares Outstanding | 340.0 | 5.770283e+08 | 8.458496e+08 | 2.767216e+07 | 1.588482e+08 | 3.096751e+08 | 5.731175e+08 | 6.159292e+09 |
| P/E Ratio | 340.0 | 3.261256e+01 | 4.434873e+01 | 2.935451e+00 | 1.504465e+01 | 2.081988e+01 | 3.176476e+01 | 5.280391e+02 |
| P/B Ratio | 340.0 | -1.718249e+00 | 1.396691e+01 | -7.611908e+01 | -4.352056e+00 | -1.067170e+00 | 3.917066e+00 | 1.290646e+02 |
Observartions:
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df, feature, figsize=(10, 5), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Categorical Data
# GICS Sector
labeled_barplot(df, 'GICS Sector', perc=True)
# Calculate the top four sectors (those with > 10%) by number of stocks
top_four = df["GICS Sector"].value_counts().head(n=4)
# Calculate the cumulative percentage
top_four_cum = (df["GICS Sector"].value_counts(normalize=True).head(n=4).sum()) * 100
# Print the results
print(f"The top 4 sectors (by number of stocks) are:\n\n{top_four}\n")
print(f"These top 4 sectors account for {top_four_cum:.2f}% of the data.\n")
The top 4 sectors (by number of stocks) are: Industrials 53 Financials 49 Consumer Discretionary 40 Health Care 40 Name: GICS Sector, dtype: int64 These top 4 sectors account for 53.53% of the data.
Observations on the economic sector classification:
# GICS Sub Industry
labeled_barplot(df, 'GICS Sub Industry', perc=True)
# Calculate the number of subsectors
total_sub = df["GICS Sub Industry"].nunique()
print(f"[NOTE: The total number of sub-sectors: {total_sub}]\n")
# Calculate the top seven sub-sectors (those with > 3%) by number of stocks
top_seven_sub = df["GICS Sub Industry"].value_counts().head(n=7)
# Calculate the cumulative percentage
top_seven_sub_cum = (df["GICS Sub Industry"].value_counts(normalize=True).head(n=4).sum()) * 100
# Print the results
print(f"The top 7 sub-sectors (by number of stocks) are:\n\n{top_seven_sub}\n")
print(f"These top 7 sub-sectors account for {top_seven_sub_cum:.2f}% of the data.\n")
[NOTE: The total number of sub-sectors: 104] The top 7 sub-sectors (by number of stocks) are: Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Internet Software & Services 12 Electric Utilities 12 MultiUtilities 11 Health Care Equipment 11 Name: GICS Sub Industry, dtype: int64 These top 7 sub-sectors account for 16.47% of the data.
Observations on the economic sub-sector classification:
Numerical Data
# Current Price
histogram_boxplot(df, 'Current Price')
Observations on Current Price:
# Price Change
histogram_boxplot(df, 'Price Change')
Observations on Price Change:
# Volatility
histogram_boxplot(df, 'Volatility')
Observations on Volatility:
# ROE
histogram_boxplot(df, 'ROE')
Observations on Return on Equity:
# Cash Ratio
histogram_boxplot(df, 'Cash Ratio')
Observations on Return on Cash Ratio:
# Net Cash Flow
histogram_boxplot(df, 'Net Cash Flow')
Observations on Net Cash Flow:
# Net Income
histogram_boxplot(df, 'Net Income')
Observations on Net Income:
# Earnings Per Share
histogram_boxplot(df, 'Earnings Per Share')
Observations on Earnings Per Share:
# Estimated Shares Outstanding
histogram_boxplot(df, 'Estimated Shares Outstanding')
Observations on Estimated Shares Outstanding:
# P/E Ratio
histogram_boxplot(df, 'P/E Ratio')
Observations on Price-to-Earnings Ratio:
# P/B Ratio
histogram_boxplot(df, 'P/B Ratio')
Observations on Price-to-Book Ratio:
Correlations between numerical variables
# correlation check
plt.figure(figsize=(8, 4))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="BrBG"
)
plt.show()
Observations:
Let's check the stocks of which economic sector have seen the maximum price increase on average.
plt.figure(figsize=(12,5))
sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
plt.xticks(rotation=90)
plt.show()
df.groupby(['GICS Sector'])['Price Change'].mean().sort_values(ascending = False)
GICS Sector Health Care 9.585652 Consumer Staples 8.684750 Information Technology 7.217476 Telecommunications Services 6.956980 Real Estate 6.205548 Consumer Discretionary 5.846093 Materials 5.589738 Financials 3.865406 Industrials 2.833127 Utilities 0.803657 Energy -10.228289 Name: Price Change, dtype: float64
Observations:
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. Let's see how the average cash ratio varies across economic sectors.
plt.figure(figsize=(10,4))
sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
plt.xticks(rotation=90)
plt.show()
df.groupby(['GICS Sector'])['Cash Ratio'].mean().sort_values(ascending = False)
GICS Sector Information Technology 149.818182 Telecommunications Services 117.000000 Health Care 103.775000 Financials 98.591837 Consumer Staples 70.947368 Energy 51.133333 Real Estate 50.111111 Consumer Discretionary 49.575000 Materials 41.700000 Industrials 36.188679 Utilities 13.625000 Name: Cash Ratio, dtype: float64
Observations:
The highest average Cash Ratios are seen in Information Technology sector (149.818), the Telecommunications Services sector (117.000), and the Health Care sector (103.775). Companies in these sectors often require ample cash reserves for research and development, investment in new technologies, potential legal liabilities, and adapting to rapidly changing market conditions.
The lowest average Cash Ratios are seen in the Utilities sector (13.625), the Industrials sector (36.189), and the Materials sector (41.700). Companies in these sectors often face upfront capital investments, maintenance, and regulatory compliance which may limit available cash for short-term obligations.
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.
plt.figure(figsize=(10,4))
sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
GICS Sector Energy 72.897709 Information Technology 43.782546 Real Estate 43.065585 Health Care 41.135272 Consumer Discretionary 35.211613 Consumer Staples 25.521195 Materials 24.585352 Utilities 18.719412 Industrials 18.259380 Financials 16.023151 Telecommunications Services 12.222578 Name: P/E Ratio, dtype: float64
Observations:
The sectors with the highest P/E ratios are the Energy sector (72.898), Information Technology sector (43.783), the Real Estate sector (43.066), and the Health Care sector (41.135). This can likely be attributed to expectations of strong earnings growth, favorable industry dynamics, profit margins, investor sentiment, and valuation comparisons relative to other sectors.
The sectors with the lowest P/E ratios are the Telecommunications Services sector (12.223), the Financials sector (16.023), Industrials sector (18.259), and the Utilities sector (18.719). This can likely be attributed to factors such as growth expectations, regulatory constraints, economic sensitivity, profit margins, risk perception, and market dynamics specific to each sector.
Volatility accounts for the fluctuation in the stock price. A stock with high volatility will witness sharper price changes, making it a riskier investment. Let's see how volatility varies, on average, across economic sectors.
plt.figure(figsize=(10,4))
sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
df.groupby(['GICS Sector'])['Volatility'].mean().sort_values(ascending = False)
GICS Sector Energy 2.568777 Materials 1.816726 Information Technology 1.659801 Consumer Discretionary 1.595478 Health Care 1.541023 Industrials 1.416989 Telecommunications Services 1.341612 Financials 1.267255 Real Estate 1.206053 Consumer Staples 1.152675 Utilities 1.118018 Name: Volatility, dtype: float64
Observations:
The sectors with the highest high stock price volatility are the Energy sector (2.560), and the Materials sector (1.817). This can likely be attributed to their sensitivity to commodity price fluctuations, global economic conditions, regulatory factors, market dynamics, and supply chain disruptions.
The sectors with the lowest stock price volatility are the Utilities sector (1.118), the Consumer Staples sector (1.153), and the Real Estate sector (1.206). This can likely be attributed to stable demand patterns, regulatory stability, income-generating properties, and resilience to economic downturns.
# Check for duplicate values
df.duplicated().sum()
0
# Check for missing values in training set
df.isna().sum()
Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
# Plot the boxplots of all numerical columns to check for outliers.
plt.figure(figsize=(12, 10))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# scaling the data before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
# creating a dataframe of the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
# checking the new dataframe
subset_scaled_df.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df, height=2,aspect=2 , diag_kind='kde')
plt.show()
Observations:
# copy scaled dataframe
k_means_df = subset_scaled_df.copy()
# Find the average distortion in different potential clusters
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2603612005599585 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.131715108373145 Number of Clusters: 6 Average Distortion: 2.0598763618226803 Number of Clusters: 7 Average Distortion: 2.01213392603549 Number of Clusters: 8 Average Distortion: 1.9994328132941857 Number of Clusters: 9 Average Distortion: 1.9247285037268589 Number of Clusters: 10 Average Distortion: 1.8539203373868955 Number of Clusters: 11 Average Distortion: 1.8149636050068858 Number of Clusters: 12 Average Distortion: 1.7604329006627772 Number of Clusters: 13 Average Distortion: 1.7489463888565202 Number of Clusters: 14 Average Distortion: 1.6807389477433747
Observations:
# Visually inspect the relationship between the number of clusters and the distortion to
# help determine the optimal number of clusters. We will look for an "elbow" point in the
# plot of distortion against the number of clusters.
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
# Plotting distortions with method using KElbowVisualizer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
Observation:
# Find the silhouette score (a measure of how similar an object is to its own cluster compared to other clusters) for different potential clusters
sil_score = []
cluster_list = list(range(2, 15))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.43969639509980457 For n_clusters = 3, silhouette score is 0.45797710447228496 For n_clusters = 4, silhouette score is 0.45017906939331087 For n_clusters = 5, silhouette score is 0.402932236144343 For n_clusters = 6, silhouette score is 0.4118459845534201 For n_clusters = 7, silhouette score is 0.39397699902210176 For n_clusters = 8, silhouette score is 0.3451545104605459 For n_clusters = 9, silhouette score is 0.41252387300312193 For n_clusters = 10, silhouette score is 0.15248306043525572 For n_clusters = 11, silhouette score is 0.14547527734822369 For n_clusters = 12, silhouette score is 0.109888835910011 For n_clusters = 13, silhouette score is 0.16435888262634815 For n_clusters = 14, silhouette score is 0.15706251995254603
[<matplotlib.lines.Line2D at 0x7f56dd2e4f40>]
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
Observations:
# Check for optimal number of clusters with silhouette plots
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# Check for optimal number of clusters with silhouette plots
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# Check for optimal number of clusters with silhouette plots
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
We will use 4 as the appropriate number of clusters as the silhouette score is high enough (0.458), there is a clear bend in the elbow curve at 4.
# final K-means model
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(k_means_df)
KMeans(n_clusters=4, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=1)
# creating a copy of the original data
df1 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_ # scaled
df1["KM_segments"] = kmeans.labels_ # original
# Group by kmeans cluster labels
km_cluster_profile = df1.groupby("KM_segments").mean() # Group by the cluster labels
# Add counts for number of stocks in each cluster
km_cluster_profile["count_in_each_segment"] = df1.groupby("KM_segments")["Security"].count().values
# Display cluster profiles
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KM_segments | ||||||||||||
| 0 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -14046223.826715 | 1482212389.891697 | 3.621029 | 438533835.667184 | 23.843656 | -3.358948 | 277 |
| 1 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1072272727.272727 | 14833090909.090910 | 4.154545 | 4298826628.727273 | 14.803577 | -4.552119 | 11 |
| 2 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -159428481.481481 | -3887457740.740741 | -9.473704 | 480398572.845926 | 90.619220 | 1.342067 | 27 |
| 3 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1554926560.000000 | 1572611680.000000 | 6.045200 | 578316318.948800 | 74.960824 | 14.402452 | 25 |
# Print the companies in each cluster
for cl in df1["KM_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df1[df1["KM_segments"] == cl]["Security"].unique())
print()
In cluster 0, the following companies are present:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Archer-Daniels-Midland Co', ..., 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 277
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 3, the following companies are present:
['Analog Devices, Inc.', 'Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', ..., 'TripAdvisor', 'Vertex Pharmaceuticals Inc', 'Waters Corporation', 'Wynn Resorts Ltd', 'Yahoo Inc.']
Length: 25
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 2, the following companies are present:
['Apache Corporation', 'Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Chesapeake Energy', 'Cabot Oil & Gas', ..., 'Range Resources Corp.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy']
Length: 27
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 1, the following companies are present:
['Citigroup Inc.', 'Ford Motor', 'Gilead Sciences', 'Intel Corp.', 'JPMorgan Chase & Co.', ..., 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
Length: 11
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
# Check the distribution of securities across different segments and sectors.
for k in range(0,df1['KM_segments'].nunique()):
print('\033[1mCluster '+str(k)+'\033[0m TGICS Sector counts:')
print(df1[df1['KM_segments']==k]['GICS Sector'].value_counts())
print(" ")
Cluster 0 TGICS Sector counts: Industrials 52 Financials 45 Consumer Discretionary 33 Health Care 29 Real Estate 26 Information Technology 24 Utilities 24 Materials 19 Consumer Staples 17 Energy 6 Telecommunications Services 2 Name: GICS Sector, dtype: int64 Cluster 1 TGICS Sector counts: Financials 3 Health Care 2 Telecommunications Services 2 Consumer Discretionary 1 Consumer Staples 1 Energy 1 Information Technology 1 Industrials 0 Materials 0 Real Estate 0 Utilities 0 Name: GICS Sector, dtype: int64 Cluster 2 TGICS Sector counts: Energy 22 Information Technology 3 Industrials 1 Materials 1 Consumer Discretionary 0 Consumer Staples 0 Financials 0 Health Care 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64 Cluster 3 TGICS Sector counts: Health Care 9 Consumer Discretionary 6 Information Technology 5 Consumer Staples 1 Energy 1 Financials 1 Real Estate 1 Telecommunications Services 1 Industrials 0 Materials 0 Utilities 0 Name: GICS Sector, dtype: int64
# generate boxplots for numerical variables within each cluster
plt.figure(figsize=(18, 18))
plt.suptitle("Boxplot of numerical variables for each cluster")
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x="KM_segments", y=variable)
plt.tight_layout(pad=2.0)
# check the mean values of each numerical column for each cluster
cluster_profile = df1.groupby("KM_segments").mean()
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| KM_segments | |||||||||||
| 0 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -14046223.826715 | 1482212389.891697 | 3.621029 | 438533835.667184 | 23.843656 | -3.358948 |
| 1 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1072272727.272727 | 14833090909.090910 | 4.154545 | 4298826628.727273 | 14.803577 | -4.552119 |
| 2 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -159428481.481481 | -3887457740.740741 | -9.473704 | 480398572.845926 | 90.619220 | 1.342067 |
| 3 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1554926560.000000 | 1572611680.000000 | 6.045200 | 578316318.948800 | 74.960824 | 14.402452 |
# visualize the mean values of each numerical column for each cluster
df1.groupby("KM_segments").mean().plot.bar(figsize=(10, 6))
<Axes: xlabel='KM_segments'>
Cluster 1 (11 stocks):
Cluster 2 (27 stocks):
Cluster 3 (25 stocks):
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
Observations:
Explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
Dendrograms for the different linkage methods with Euclidean distance.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"] ## Complete the code to add linkages
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 4 | ward | 0.710118 |
| 1 | complete | 0.787328 |
| 5 | weighted | 0.869378 |
| 0 | single | 0.923227 |
| 3 | centroid | 0.931401 |
| 2 | average | 0.942254 |
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_Clusters"] = HCmodel.labels_
df2["HC_Clusters"] = HCmodel.labels_
cluster_profile = df2.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df2.groupby("HC_Clusters")["Current Price"].count().values
)
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 77.573266 | 4.148438 | 1.515708 | 35.184524 | 67.154762 | 67104693.452381 | 1607391086.309524 | 2.905640 | 572317821.413095 | 32.325679 | -1.762402 | 336 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
# let's see the names of the securities in each cluster
for cl in df2["HC_Clusters"].unique():
print("In cluster {}, the following securities are present:".format(cl))
print(df2[df2["HC_Clusters"] == cl]["Security"].unique())
print()
In cluster 0, the following securities are present:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Analog Devices, Inc.', ..., 'Yahoo Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 336
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 2, the following securities are present:
['Apache Corporation', 'Chesapeake Energy']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 3, the following securities are present:
['Facebook']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
In cluster 1, the following securities are present:
['Priceline.com Inc']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
# df2.groupby(["HC_Clusters", "GICS Sector"])['Security'].count()
# print the number of stocks in each GICS sector for each cluster
for k in range(0,df2['HC_Clusters'].nunique()):
print('\033[1mCluster '+str(k)+'\033[0m GICS Sector stock counts:')
print(df2[df2['HC_Clusters']==k]['GICS Sector'].value_counts())
print(" ")
Cluster 0 GICS Sector stock counts: Industrials 53 Financials 49 Health Care 40 Consumer Discretionary 39 Information Technology 32 Energy 28 Real Estate 27 Utilities 24 Materials 20 Consumer Staples 19 Telecommunications Services 5 Name: GICS Sector, dtype: int64 Cluster 1 GICS Sector stock counts: Consumer Discretionary 1 Consumer Staples 0 Energy 0 Financials 0 Health Care 0 Industrials 0 Information Technology 0 Materials 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64 Cluster 2 GICS Sector stock counts: Energy 2 Consumer Discretionary 0 Consumer Staples 0 Financials 0 Health Care 0 Industrials 0 Information Technology 0 Materials 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64 Cluster 3 GICS Sector stock counts: Information Technology 1 Consumer Discretionary 0 Consumer Staples 0 Energy 0 Financials 0 Health Care 0 Industrials 0 Materials 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64
# boxplot of numerical variables for each cluster
fig, axes = plt.subplots(3, 4, figsize=(18, 18))
counter = 0
for ii in range(3):
for jj in range(4):
if counter < 11:
sns.boxplot(
ax=axes[ii][jj],
data=df2,
y=df2.columns[3+counter],
x="HC_Clusters",
palette="coolwarm"
)
counter = counter + 1
fig.tight_layout(pad=3.0)
df2.groupby("HC_Clusters").mean().plot.bar(figsize=(10,6))
<Axes: xlabel='HC_Clusters'>
Cluster 0 (336 securities):
Cluster 1 (1 security):
Cluster 2 (2 securities):
Cluster 3 (1 security):
Which clustering technique took less time for execution?
Which clustering technique gave you more distinct clusters, or are they the same?
How many observations are there in the similar clusters of both algorithms?
In each case, Clusters 1, 2, and 3 were very specialized compared to Cluster 0.
In KMeans clustering:
In Agglomerative clustering:
Here is a quick summary of the four clusters and their investment profiles:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
General Insights
Please note:
As always, investors should consider their risk tolerance, investment goals, and sector preferences while evaluating these clusters for investment opportunities. Additionally, further analysis of individual stocks within each cluster is recommended for more informed decision-making.